Python# for each point, calculate the distance to the centroids and assign it to the closest centroid for point in zip(x, y): distances = [] for centroid in assignments.keys(): distances.append(distance(point, centroid)) assignments[get_closest_centroid(distances, centroids)].append(point)
Python# compute the new centroids new_centroids = [] for centroid, points in assignments.items(): # assignments: dict[centroid, points] new_centroids.append(get_x_mean(points), get_y_mean(points)),
K-means can also be used to get the boundaries of the voronoi regions.
This is the criterion for separating samples in K groups of equal variance. By minimizing a the "inertia" or "within-cluster sum-of-squares"
There are only 2 points in which k-means changes its values:
Those two operations will never increase the value of the loss, at most it stays the same.
Tries to minimize the spread of each cluster.
K-means is always guaranteed to converge(to end), at least to a local optimum.
How do we choose the initial centroids effectively?
We choose the first one arbitrarily from the data points.
The other ones, we try to place them as far away as possible from all the centroids that we just placed.
Pythoncentroids = [(random_x, random_y)] for k in range(K): # for each point, get their closest centroid and get the distance from it closest_centroids = [point, min(centroids, key=|point - c|) for point in points] # now get the max distance of every point from their closest centroid new_centroid = max(closest_centroids, key=|point - centroid|)[0] centroids.append(new_centroid) # this is not strict python and it is not efficient, only for a better understanding.
To select a new centroid:
This approach has a big problem though, because the centroids are always gonna be placed on outliers, if there are any.
We have an outliers problem.
We don't take the max() anymore(for the distance from the nearest centroids), because if we do that, if there is an outlier, the centroid will stabilize on that outlier.
We don't like this because the outlier could take out a whole possible cluster.
So instead of taking the max (the outlier for sure), every point as a probability of being the new centroid.
The probability of each point of being selected is proportional to the distance from their nearest centroid.
The higher this distance is, the higher probability the point has of being the new centroid.
randomness(K-means++) > randomness(Kmeans + Furthest-first) becuse with Furthest-first we select at randonm only at the start; then all the rest is deterministic.Kmeans++ at each cluster selection there is stochasticity involvedBut how do we actually sample from a set of distances?
This is the probability mass function, which gives the probability that the random variable takes the value d'. In other words, it maps each possible value of the random variable to its probability.
Where:
This is not really correct but who gives a shit?
By choosing the type of norm that we use to compute distances, we can produce a whole family of algorithms.
We need to get the k that gives a large gap between k-1 means and k-means cost functions.
We can't simply use the loss function(it would be overfitting).
We pay a penalty depending on the number of k(number of clusters).
You are allowed to change the number of clusters
We construct a new loss function:
We want to compress the colors in an image to make it smaller on the disk:
Look at the color space as a 3d space with axis R, G and B:
You cluster the colors with k-means, and you get the centroids:
When you get k colors, every pixel now becomes a pointer to one of those colors. If you have 16 colors, each pixel is composed by just 4 bits.
Nowadays, image recognition is done through neural networks, but before that we did it like this:
You cluster the images together(using k-means), then the centroids become the patch tokens.
We take patches of the image. We build an histogram of it.